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Abstract 

Relationship-aware sequential pattern mining is the problem 
of mining frequent patterns in sequences in which the events 
of a sequence are mutually related by one or more concepts 
from some respective hierarchical taxonomies, based on 
the type of the events. Additionally events themselves 
are also described with a certain number of taxonomical 
concepts. We present RaSP an algorithm that is able to 
mine relationship-aware patterns over such sequences; RaSP 
follows a two stage approach. In the first stage it mines 
for frequent type patterns and all their occurrences within 
the different sequences. In the second stage it performs 
hierarchical mining where for each frequent type pattern and 
its occurrences it mines for more specific frequent patterns 
in the lower levels of the taxonomies. We test RaSP on a 
real world medical application, that provided the inspiration 
for its development, in which we mine for frequent patterns 
of medical behavior in the antibiotic treatment of microbes 
and show that it has a very good computational performance 
given the complexity of the relationship-aware sequential 
pattern mining problem. 

Keywords: Sequential pattern mining; relations; 
taxonomies. 

1 Introduction 

Frequent pattern mining is one of the most popular 
paradigms in data mining. One of the most well studied 
problem is that of frequent itemset mining in which 
we mine for sets of items that appear often together, 
IH [12]. To address the huge number of patterns 
that these methods generate, different formulations of 
the itemset mining problem have been proposed such as 
mining for closed or maximal frequent patterns [131 [3] ■ 
Other forms of frequent pattern mining include mining 
for frequent sequences in which item order is important, 
mining in the presence of item taxonomies where the 
items are described in terms of concept hierarchies, 
e.g. GSP [HJ [IB]- In this paper we will focus on 
sequential pattern mining in the presence of taxonomies 
where in addition the items — events — within a sequence 
are related. In standard frequent sequence mining 
with taxonomies there can be a considerable loss of 
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information in the abstract patterns, namely how items 
relate. For example, a frequent market basket pattern 
which states that many people buy some product A and 
then buy a second product B one week later, where A 
and B are abstract concepts from the same taxonomy 
node, is not particularly informative. However we know 
much more if we do know that A == B, i.e. these 
persons buy the same product again regardless of which 
product is it in the base level. The type of relationships 
between events that we will introduce and mine over 
do not limit to equality /difference relationships but 
they can be general discrete multi-value relationships, 
possibly multilevel described also by taxonomies of 
concepts. We present a novel algorithm RaSP that is 
able to mine for relationship-aware sequential patterns 
in the presence of taxonomies describing the events and 
their relationships. The algorithm has two stages in 
the first stage we mine for frequent patterns of types 
and all their occurences within each sequence of some 
database of sequences. In the second stage we refine 
these type patterns using the taxonomical information 
of the events and relationships. The rest of the paper is 
structured as follows: in section [2] we give a description 
of the relationship-aware sequences over which we will 
be mining and define the problem of relationship-aware 
sequential pattern mining; in section [3] we describe 
our mining algorithm, and discuss its computational 
complexity in section 2J in section [5] we explore the 
comportment of our algorithm on a medical problem 
in which we look for patterns of medical behavior; in 
section [5J we discuss the related work and we conclude 
in section [7] 

2 Formal Definition 

A sequence £ is an ordered list of £j elements which we 
denote as £ = (E1S2 ■ • • £„}. An element, £j, can be of 
two kinds either an event ej or a transaction separator 
denoted by " ;" which denote a passing of time. Thus an 
example of a S sequence would be: 

(2.1) £ = (eie 2 . . . e k ;e k +2 e n ) 

By £( e ) we denote the sequence that consists only of 
the events of £, i.e. no transaction separators. The 
operator |£| denotes the length of the sequence, i.e. the 
number of events in £ plus the number of transaction 



operators; the ith symbol of the sequence is given by 
Similarly the operator |S^ e ' | gives the number of events 
of the S sequence and we can retrieve its ith event by 
e| . Additionally we define the function E(H, i) which 
returns the index of the zth event in the S sequence, 
i.e. £_E(£,i) = j note that i and E(^S,i) in general 
are not the same due to the presence of transaction 
separators in the sequence. 

An event e has some event type t, t — T(e). Each 
event type t has an associated array of taxonomies 
Xt = [Xt 1 , . . . , At J that we call the event type schema. 
A taxonomy A is a directed tree representing is-a 
relationships defined over the set of concepts C(X) that 
are the nodes of the tree. A directed edge from a concept 
Cj to a concept cj denotes that q is a generalization of 
Cj. We say that concept q subsumes Cj, denoted by 
Cj |= Cj, if there is a directed path in the tree starting 
from Cj and arriving at Cj. Additionally for every pair 
of event types, U,tj, we have a symmetric relationship 
which has a relationship type ptit r Similarly to the 
event types any relationship type is associated with a 
array of taxonomies X Pf t . = [X Pt:t , , . . . ,X Pt , t , ] that 
we call the relationship type schema. Note that that 
event type schemas and relationship type schemas can 
be empty. If the length of the relational type schema is 
zero for all relationship types we are in the framework 
of frequent sequential pattern mining in the presence 
of taxonomies, e.g. GSP, if in addition the event types 
have a zero length event type schema then we are in the 
standard frequent sequential pattern mining framework. 
It is possible to include also n-ary relations however for 
simplicity we omit that. 

An event e of type t has an associated event- 
concepts array c(e) = [ci, C2, ■ • • c k ], of the same length 
as the taxonomies array of its type, where c, € C(A ti ); 
note here that events of the same type can be associated 
to concepts that are found on different abstraction lev- 
els of the respective ontologies. Similarly for every pair 
of events, e k ,ei, in some sequence, with corresponding 
event types U and tj which define the respective rela- 
tionship type ptitj , we have an associated relationship- 
concepts array r(e k ,ei) = [r±, r%, . . . r<j], of the same 
length as the relationship type schema of p^u , where 
?*; G C(X Pt t , ). The subsumption operator is also de- 
fined for arrays of concepts provided that these have the 
same length and their kth elements belong to the same 
ontology X k . For any two concept arrays Cj, Cj, we will 
say that Cj |= Cj if Mk,Ci k \= cj k . The sequences that 
we will be considering in this paper will be relationship- 
aware sequences that are fully described by the events 
they contain and the relationships between these events 
as discussed above. 

A transaction is the sequence of events betweem 



two consecutive transaction separators of a S sequence, 
or the complete sequence from the beggining of X to 
the first transaction separator. Events in a transaction 
are assumed to be different. They are sorted first 
by the lexicographical order of their respective types, 
and then, in case of type equality, by their concept 
arrays according to the first different concept. The 
concept order is given by a pre- or post-order depth first 
traversal of the respective taxonomy. This ensures that 
two semantically identical transactions have the same 
transcription. We assume that no empty transactions 
exist. 

We will define a number of sequence representations 
that will be used by our algorithm. We will construct 
the type- and- concept- aw are representation, S„, of a 
sequence S in a manner that will contain for each of 
its events, e k £ X, their type, T(e k ), and event concept 
arrays, c(e&), and for each pair of its events, e k , e/, their 
respective relationship concept arrays, r{e k ,e{). Since 
relationships are symmetric we only need to include one 
of the two r(efc, e;), r(e k , &i), relationship concept arrays 
in the concept aware representation. We will include 
these r(efc,e;) for which I < k, i.e. we will describe the 
relationships of every event only to its preceding events, 
and for each k we will order the r(e k ,ei) in ascending 
order from I = 1 to I = k — 1. Thus the representation 
of each event in a sequence will now be a complex 
structure of the form 

e~ k = [T(ek),c(e k ),T(e k , ei),r(e k , e 2 ), ■ ■ ■ , r(e fe , e k -i)] 

in which the first element gives the type of the event, 
the second is the concept array of the event, and the re- 
maining ones are the different relationship concept ar- 
ray of this event with all its previous ones in the se- 
quence. The type-and-concept-aware representation of 
the example sequence given in eq 12.11 will be: S a = 
(e~ie~2 . . . e k ; e k +i e~ n ). In addition to the type- 
and-concept-aware representation of a sequence we will 
also define its type-aware and concept-aware representa- 
tions. The former will be the sequence that only con- 
tains the type information of the events and the latter 
the sequence constructed from the concatenation of the 
elements of the events' event-concepts and relationship- 
concepts arrays. For the example sequence given in 
eq !2. li the type-aware representation will be: 

S ta = (T( ei )T(e 2 ) . . . T{e k ); T(e fe+1 ) T(e n )> 

and the concept-aware: 

S ca = (c(ei)c(e 2 )f(e 2 ,ei) . . .c(e fc )f(e fe ,ei) . . .f(e fc ,e fc _i) 
c(e fc+ i)r(e&,ei) . . .r(e k ,e k ) 
. . .c(e„)?(e fe! ei) . . . f (e n , e n -i)) 



where for an array v = [v±,V2, ■ ■ • , Vfe], v returns the 
elements of v in the order in which they appear in v i.e. 

V = VlV2 ■ ■ ■ Vk- 

The projection of a sequence E to an array of 
indeces, v, is that subsequence whose elements are the 
elements of the original sequence whose indeces are 
given by v. We will denote the projection of E onto v 
by S _!_ v. Obviously since a projection of a sequence is 
also a sequence we can have its different representations 
introduced previously. 

A relationship-aware sequential pattern, II, is de- 
fined in the same manner as a relational sequence with 
the difference that, unlike sequences, it can have iden- 
tical elements within transactions. Similarly to the se- 
quences notation | IX | denotes the length of the pattern, 
i.e. number of events plus the number of transaction 
operators, IT is the i-th element of the pattern, Il( e ) is 
the pattern that consists of only the events of II, i.e. 
no transaction separators, is its ith element, and 
|Il' e ) I is the number of events of the II pattern. More- 
over patterns also have the different types of represen- 
tations that we presented for sequences, i.e. type-and- 
concept-aware, type-aware and concept-aware. Some 
additional comments on notation are also in order. In 
general we will denote arrays by bold variables, e.g. v, 
and the ith element of an array with a normal typeset, 
e.g. Vk] the notation Vk will be used to index arrays and 
not their elements. For brievity reasons we will refer to 
relationship-aware patterns as patterns. 

A pattern II matches to a sequence S if and 
only if there exists an events index vector A = 
(Ai, A 2 , . . . , A|n<=|) in whose length is equal to the 

number of events in II, and 1 < Ai < |E' e )|, such that: 

. vn| e) r(n| e) ) == t(e«) a c(n| e) ) h c^lf) 
. v(nj e >,nW) M^j =► r(n| e \nf ) \= ^WeJ) 

• Any pair of events in E( e ' _L A which are separated 
by at least one transaction separator in the E 
sequence must also be so in II, and vice-versa, i.e. 
a; b does not match to ab, nor the opposite. 

Note that to get the events index vector with respect to 
the full E sequence we just need to apply element-wise 
the function E(E, ) on the A vector; we will denote this 
element-wise application and the resulting index vector 
by£(E,A). 

Additionally we define a number of constraints on 
the form of the mined patterns namely a max-gap 
constraint, mg, and a maximum-projected-length con- 
straint, mpl, by defining constraints on the respective 
indeces vectors A. The max-gap constraint is defined 



as Ai — Ai + i < mg and the maximum-projected-length 
constraint as A|^| — Ai < mpl. 

Given a database of sequences the support of a 
pattern is the number of sequences it matches. We 
will call an occurence of a pattern II; the couple Oi k = 
(£l h ,\l k ) where E; fc is some sequence and \i k is the 
event index vector of that pattern in Ej . Each pattern 
is associated with a set of occurences O/ = {Oi k \k = 
1 . . . M}. The cardinality, |0/|, of the set of occurences 
is the number of occurences of the 11; pattern. Clearly 
a given sequence can include multiple repetitions of the 
same pattern, thus it can be that for different occurences 
Oi i and 0\. of the 11/ pattern that E; i == Ej^ in which 
case nevertheless the respective vector of indeces 
and A/, will be different. The number of occurences of 
a pattern is trivially larger or equal to its support. 

Given a set of relational sequences S = 
{Si, S2, • • • , S|s|} we will define as Relationship- 
aware Sequential Pattern Mining the discovery of all 
relationship- aware patterns, or a well-defined subset of 
them, with support in S larger than some threshold 6. 
In the next section we will describe an algorithm, RaSP, 
for mining such patterns. 

3 Mining for relationship-aware sequential 
patterns 

We address the problem of mining for relationship- 
aware sequential patterns using a two stage algorithm. 
In the first stage we will mine for frequent patterns 
of types, i.e. we will be mining over the type-aware, 
E ta , representations of the sequences. We will call 
such a resulting pattern a type-pattern and denote it by 
nt Q ; we use the subscript ta to emphasize that these 
patterns are in the type-aware representation, like the 
sequences from which they were produced, since they 
are type-patterns. For each one of these patterns we will 
compute its associated set of occurences. Mining over 
the type-aware representation of the sequences gives us 
patterns that are defined only over the roots of the 
event-type and relationship-type taxonomies. To get 
the final refined^ frequent patterns of each frequent 
type-pattern II ta in the second stage of the algorithm 
we retrieve for each II ta and its set of occurences 
all the associated sequences. Based on the concept- 
aware representation of these sequences we get the 
refined patterns by solving one frequent pattern mining 
problem for each frequent type-pattern. We should note 
here that given a specific type-pattern the associated 
concept-aware sequences are all of the same length. A 
basic difference of our approach from standard frequent 



i Thc refined patterns will be defined over the different levels 
of the taxonomies and not just their roots. 



pattern mining algorithms is that we need to compute 
all the occurences of a given frequent type-pattern, 
standard pattern mining algorithms only look at the 
presence (typically the first) or absence of a pattern in a 
sequence. We need the occurences of each frequent type- 
pattern because it is over them that we can compute the 
different frequent refinements of a given type-pattern. 
If we were limited only to the first presence of a 
type-pattern in a sequence we would be computing a 
distorted and incomplete picture of its refined patterns. 

3.1 Mining for type-patterns In this stage of the 
algorithm we will be mining for frequent type-patterns 
on the sequence set S over the type-aware representation 
of the sequences. We cannot use a standard off-the- 
self frequent pattern mining algorithm since as just 
mentioned these do not discover all occurences within a 
sequence of a given pattern. In order to discover all of 
them we need to modify some frequent pattern mining 
algorithm. We have opted for a modification of the well 
known GSP algorithm, [TO], because it was easier to 
adapt it to the requirements of our problem compared 
to more efficient alternatives such as PrefixSpan, [14]. 

The GSP algorithm is based on the Apriori property 
which states that if a pattern II is frequent then all 
its subpatterns are also frequent. GSP works in an 
iterative manner it first finds all frequent patterns of 
length k and then given these patterns it determines 
all possible candidate patterns of length k + 1 according 
to the Apriori property, candidate generation phase, and 
subsequently checks these patterns against the database 
to determine their frequency, candidate checking phase. 
To adapt GSP so that it also computes the set of 
occurences of each pattern we only need to significantly 
modify the candidate checking phase, the candidate 
generation hardly changes. 

In Algorithm [1] we give a basic algorithm that finds 
all occurrences of a II pattern in a S sequence with no 
transaction separators. The main functionality of the 
algorithm is delivered by the MGSP_SCC function. 
The basic idea is that given a element match, IT == £j, 
we continue the search with i <— i + 1 and j <— j + 1, i.e. 
we account for the specific element match, but also with 
only j = j + 1 while keeping i unchanged, i.e. we ignore 
this match. If there is no match, we just increment 
j and continue. In Algorithm [5] we give the more 
evolved function, MGSP_GCC, for candidate checking 
and all occurences finding for patterns and sequences 
that contain transaction separators which is follows the 
same lines as its simpler version MGSP_SCC. The 
utility function next(Z,i) indicates the next element in 
the sequence or pattern Z which is not a transaction 
separator. 



We implemented certain optimizations to speed up 
the algorithm. For each type-pattern to check, as well as 
for each sequence in our database, we create a multiset 
of the elements in the form of a vector. For instance, 
given types a,b,c and d, the vector mx (where X is a 
sequence or a pattern) is equal to [#a G X, #b G X, #c 
G X,#d G X], where "#x G X" is the number of times 
type x appears in the sequence or pattern X. When 
checking type-pattern II against sequence S ta , we first 
compute the vector rrid = ms — mjx, and if any element 
is negative, we return immediately (no match). We also 
pass as parameter to the algorithm a local vector 
which is initialized at the first call to m ( i, and which is 
incremented by 1 in position i any time one advances 
over ti in II, and decremented at position i any time one 
advances over ti in S + ta. If while advancing over a 
sequence element, that element becomes negative, we 
have an impossibility to find a match in the future, 
and we return immediately. We also store, for each 
candidate of length k, the sequence identifiers in which 
all its parent frequent patterns of length k — 1 have 
matched, and test only those sequences. 

The max-gap constraint is trivial to implement, it 
is a simple check and return, and so is the maximum- 
projected-length constraint. However one needs to 
be careful in defining constraints so that the Apriori 
property is respected. Concretely if the occurrence of 
a pattern of length k passes the test, all possible k 
combinations of length k — 1 must also pass the test. 
An example of invalid constraint is: "We want at least 
2 different event types in a pattern" because it cannot 
be satisfied at level one, but can be satisfied at higher 
levels. 

3.2 From frequent type-patterns to frequent 
relational patterns At the end of the first stage of the 
algorithm we have a set P = {II; ta \l = 1 . . . L} of type- 
patterns, where each II; ta type-pattern is associated 
with a set of occurences O; = {Oi k = (Ei k , \i k )\k = 
1 ... Mi}. In the second stage of the algorithm we will 
define a frequent itemset mining problem for each II; ta 
type-pattern on the basis of its associated occurence set, 
Oi, thus we will be solving |P| different frequent itemset 
mining problems. The patterns that will be computed 
for each of these itemset mining problems will give rise 
to the final refined frequent patterns of the original type- 
pattern. 

Given an occurence Oi k we define u; fc to be the 
projection of the event sequence on the events 

index vector A; fc , i.e. u; fc = s| _L Xi k , which is 
also equivalent to the S; fc _!_ E(Hi k ,\i k ) projection in 
terms of the full sequence. Since all ui k have been 
produced from the same type-pattern II; ta they all 



Algorithm 1 Modified GSP Simple Candidate Check- 
ing 

{II A pattern} 

{£ The sequence within which we look for the occurence 
of n} 

O = {The set of occurences of pattern II in sequence 
S} 

call MGSP.SCC(S,n) 

function MGSP_SCC(£,n,i=l,j=l,A = []) 

{i and j are indeces in the II pattern and £ sequence 

respectively; A is the index vector of the II pattern in the 

£ sequence.} 

if i > |n| then 

O = OuO{(£,A)} 

return 
end if 

if j > |S| then 

return {No match} 
end if 

if n,==£j then 

MSGP_SCC(S,n,i+l,j+l,[A,j]) 
end if 

MSGP_SCC(S,n,i,j+f,A) 



have the same number of elements which is equal to 
the number of events in II; ta , i.e. |Il! |. We will 
denote with u; fe the concept-aware representation of 
ui k . Remember that the concept-aware representation 
contains both the event and relationship concepts. Let 
Cj be the jth element (concept) of u/ fe . Since Cj is a 
concept it is associated with some ontology X and we 
denote by A(cj,X) the set that contains all its ancestors 
in its taxonomy, including itself if not the root but 
excluding the root. We denote by — {(a,j)\a E 

A(cj,X)} the set that describes for the projection 
sequence u/^ which are the ancestor concepts of the 
concept Cj that appears in its jth position coupled with 
their position information j. We define the concept- 
position set, Fi k , of the projected sequence Ui k as 
Fi k — DjTi k . that gives which taxonomies concepts 
appear in which positions of u; fcca . Subsequently we 
construct the Sli set which contains the union of the 
concept-position sets over all the occurences of the O; 
set, Sli = UfcTij. , which we index based on some 
lexicographical order of its items. We can think of the 
fli set as a vocabulary that is created of all couples 
of the form (some ancestor concept of Cj,j\cj £ 
u ik ca ) gathered from the projections ui k that are 
produced from all occurences of the Ui ta pattern. We 
will use this vocabulary to describe the different Oi k 
occurences of the set of occurences O; of Ui ta by 
defining the matrix Q[ : Mi x whose fcth row 
corresponds to the Oi k occurence of O; and the columns 



to the different elements of the Sli vocabulary. The 
(k, h) element of the 0[ matrix will be one if the Oi k 
occurence contains in its concept-position set Ti kca the 
(some ancestor concept of Cj,j) pair that corresponds 
to the hth. element of Sli . 

It is on the 0[ matrix representation of the set of 
occurences O; that the frequent itemset mining will take 
place. We define an itemset i on 0[ as a vector of 
column indeces, i.e. i = (i%, Z2, . . . , i a ), a < |f2;|. It is 
important to note that an itemset i actually corresponds 
to a highly redundant refinement of the II; ta type- 
pattern that contains events and relationships concepts 
from different levels of the associated taxonomies (IIj to 
contains only event types and implicitly relationship 
types; highly redundant because it includes all anscestor 
concepts of a concept that appears in some position 
j). In fact from these itemsets we will construct the 
final relationship-aware patterns we are mining for. We 
can easily get the subset of occurences, C O;, that 
contain a given itemset i through a column vector, fj,^ 
that is created by the pairwise multiplication of the 
columns of O; that are indexed by i. If i±- lk == 1 
then the Oi k occurence contains the i itemset. However 
unlike standard frequent itemset mining we do not 
measure the support of the i itemset on the occurence 
set SI 1 but on full set of sequences S since it is the 
support in this set that we arc interested in. To 
do so we need one additional matrix 3> : Mi X |S| 
whose rows correspond again to the occurences in Sli 
and the columns to the different sequences of S, this 
matrix simply indicates to which sequence a given 
occurence belongs (all elements of a given row are zero 
except that which corresponds to the sequence to which 
the occurence belongs which is set to one, since each 
occurence appears in exactly one sequence). So the 
support of the i itemset will be given by the number 
of columns (i.e. number of sequences) of the 3? matrix 
that have at least one common non-zero entry with the 
/X; column vector. We will call the problem of finding 
all i itemsets, or a well defined subset of them such 
as closed or maximal itemsetsQ, whose support in S 
is larger than some 9 threshold an occurence/itemset 
mining problem. It can be solved by simple adaptations 
of standard frequent itemset mining problems, e.g. 
Apriori. However frequent itemset mining algorithms 
that are based on properties which are not valid in our 
case, such as sup(S) — sup(S U {a}) — sup(S U {a}) 
sup(S U {a, 6}), require more care in their adaptation. 
We used a simple adaptation of GenMax [8] (we did not 



maximal itemset is a frequent itemset that has no super set 
that is frequent, i.e. all its super sets have support lower then 8; 
a frequent itemset is closed if there is no super set of it with the 
same support. 



use diffsets), which finds maximal itemsets. 

Finally from the i itemset we will now construct 
the corresponding concept-aware representation, , 
of the Tli ta type-pattern. Remember that the itemset 
is a set of the form i = {(some ancestor concept 
of Cj,j) | Cj G u i kca i u ik ca derived from some Oi k }. 
The concept in the j position of the Ilj*^ representation 
will be, either the taxonomy root if there exists no 
element in i which contains the j index, or it will 
be c if (c, j) G i and there is no descendant d of c 
such that (d,j) G i, i.e. we include the most specific 
concept its time. Since we cannot have in some given 
position j two concepts which are unrelated to each 
other by a descendance relationship, because of the 
tree structure of the taxonomies and our definition of 
the occurence/itemset mining problem, the element in 
the position j will be unique. We can extend our 
approach to include generalization relationships that are 
defined over Directed Acyclic Graphs instead of tree- 
like taxonomies in which case we would have multiple 
elements in the same position. One way to address this 
is by looking for a common ancestor of these concepts 
which assumes that we have a rooted DAG, yet another 
one would be to extend our pattern representation to 
allow conjunction of concepts. 

From the concept-aware represenation 11^ and 
its TLi ta type-pattern we can have the final pattern 
representation n['\ So finally for each type-pattern 
IIj to we will have a set, R;, of frequent relationship- 
aware patterns R; = {n[^ \sup(i) > 9}, and the 
complete set, R, of frequent relationship-aware patterns 
will be given by the union of these sets, R = UfRf. 

4 Computational Complexity 

The computational complexity of the second stage of the 
algorithm is determined by the GenMax algorithm that 
we use there and is a multiple of the number of the type- 
patterns that were discovered in the type-pattern min- 
ing stage. The type-pattern mining complexity, assum- 
ing we are using the modified GSP described in this pa- 
per, however, has an easily-dctcrminable upper bound. 
In the case of matching a pattern of size k to a sequence 

of size n > k, there are at most ^ ™ ^ possible occur- 
rences to find, and so the theoretical worst-case com- 
plexity is O ^ ^ ^ ) ^) ' wn * cn * s mucn l ar S er than the 

0(n) needed to determine the presence or the absence 
of a pattern in a sequence. However, with a max-gap 
constraint of g events, an upper bound on the number of 
occurrences becomes (g— l) fc ~ 1 (n— k) so an upper bound 
in the complexity is O ((g — l) fc_1 (n — k)k) instead of 



Algorithm 2 Modified GSP General Candidate Check- 
ing 

function MGSP_GCC(S,n,i=lJ=l,A=Q, 

lastMatchWasTS=True,TS=" ;" ) 

if i > |n| then 

O = OuO{(£,A)} 

return 
end if 

if j I |S| then 

return {No match} 
end if 

{Optional: Do check for multiset inclusion} 

{Optional: Do check for max-gap violation} 

{Optional: Do check for max-projected-length violation} 

def previousInSequenceIsTS=(£j_i == TS) 

def nextInPatternIsTS=(ID + i == TS) 

if previousInSequencelsTS && UastMatchWasTS then 

return {Found a TS in sequence which was not in 

pattern.} 
end if 

if IF == Sj then 

if nextlnPatternlsTS then 

nextPos = position of the item just af- 
ter the next TS in S 
if nextPos==NULL then 

return {No match} 
end if 

MSGP_GCC(S,n,next(n,i), nextPos,[A, j],Truc) 
else 

MSGP_GCC(S,n,next(n,i), next(E,j),[A,j],False) 
end if 
end if 

MSGP_GCC(S,n,i,j+l, A,lastMatchWasTS) 



the presence-only 0(n 2 ) forward-backward described in 
[16]. If we include a max projected length (in terms of 
elements) w (n > w > k), an upper bound on the num- 

w 
k-1 



ber of occurrences becomes 



(n — k) and so an 



upper bound in complexity is O 



w 
k - 1 



(n — k)k 



Even if this worst-case computational complexity may 
seem overwhelming, it only actually happens when the 
type-pattern II and the type-aware sequence represen- 
tation S ta are equal to T k and T n (where T is a sin- 
gle event type) respectively. However, computational 
complexity and combinatorial explosion of occurrences 
can be a problem in some cases. Possible remedies are: 
adding a max-gap and/or max projected length con- 
straint, either in terms of event count or in terms of 
time; artificially disregarding type-patterns consisting 
of k times the same event type in a row (and all of 
its children); removing outliers in the dataset, such as 
extraordinarily long sequences, or sequences with more 
than q events of the same type. 

Another problem of our approach can be memory 
usage. It is true that storing all occurrences in main 
memory can become prohibitive. A solution to that 
problem can be to stream the occurrence data of a 
type-pattern to a file (or similar linear storage), and 
then release the main memory associated with it. This 
should be done once all occurrences of a type-pattern 
are found and has been determined as frequent. In ad- 
dition to avoid memory saturation while solving an oc- 
currence/itemset problem, a few solutions are available. 
We can use a sparse representation of occurrence vec- 
tors (our [i- values). Alternatively we can keeping only 
k occurrences per sequence when the total number of 
occurrences is higher than a value n. This makes the 
algorithm incomplete, but in the case of a small num- 
ber of outlying sequences, it does not significantly im- 
pact the usefulness of our method, while it significantly 
improves performance. 

5 Experiments 

We will test various aspects of the performance of our 
method on a dataset that is associated to a medical 
problem. In fact this medical problem was the trigger 
that led to the development of RaSP. The general 
problem that we want to address is the extraction of 
patterns of medical practice in what has to do with the 
antibiotic treatment of microbes. We have a dataset 
that contains 6659 episodes of care which have at least 
one antibiotic treatment which span nine years from 
2002 to 2010. The data have been collected from 
the clinical sites that participate in European project 
DebugIT [2]. An episode of care is the sequence of 



events that take place during the stay of some patient in 
the hospital. Since we are interested in the extraction of 
antibiotic treatment patterns the types of events that we 
are interested in are two. The first event type is B which 
corresponds to the detection — presence — of a microbe 
at some moment in time during the episode of care, the 
detection is done through a laboratory test; the same 
microbe can be detected more than once within the 
course of a given episode of care and each detection is 
considered as a different event of the B type. The second 
event type is T which corresponds to the prescription of 
a treatment with a given drug at some moment during 
the episode of care. In addition we define the following 
relationship types B x T, B x B, and T x T. The first 
relation corresponds to the notion of an antibiogram. 
An antibiogram is a laboratory test in which we test the 
sensitivity of a detected bug on some drug; if it takes 
place it does so right away after the detection of the bug. 
The two latter relations are simply identity relations 
which indicate whether their respective arguments, i.e. 
events, are the same or different. 

With each event type, B and T, there is one taxon- 
omy associated, i.e. the respective event-type schemata 
have a length of one. With the events of type microbe 
the associated taxonomy is NewT from Uniprot, and 
more precisely that part of it which has to do with 
the bacteria-microbes, [3]. With the events of type T 
the associated taxonomy is the Anatomical Therapeu- 
tic Chemical Classification System, ATC, which is used 
for the classification of drugs, and more precisely that 
part of the taxonomy which is related to Antibiotics (an- 
tibacterials for systemic use, with associated ATC code 
starting with J01), pQ. In figure[2]we give a snapshot of 
the two taxonomies. Some additional statistics on these 
two taxonomies are given in table [T] The three rela- 
tionships are also associated with one taxonomy each; 
here the associated taxonomies are simpler. The tax- 
onomy associated with B x T is given by the following 
prefix tree (Any (Tested (Sensitive, Resistant, Interme- 
diate), Not-tested)). This taxonomy describes whether 
for a given pair of a drug and a detected bug within 
an episode of care, or detected bug and drug (remem- 
ber that the relationships are symmetric) there was, or 
there was not, an antibiogram performed, Tested and 
Not-tested respectively. If there has been a test then the 
possible values of the relationshipe are Sensitive, Inter- 
mediate, Resistant, which denote the measured sensitiv- 
ity of the bug on the drug. The identity relations for 
both B x B, and T x T, are described by the simple 
prefix tree (Any(=, ^)). 

Some additional statistics on the available dataset. 
The total number of different bugs present in the 
dataset was 180. The total number of antibiotics that 
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Table 1 : Statistics of the two taxonomies that we used 



Figure 1: Histogram of # events within episodes of care. 



Figure 2: Snapshot of the Taxonomies 
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were tested in the different antibiograms was 67. The 
antibiograms are usually done in batches of 20 to 30 
drugs for a given sample (the actual number depends 
on the microbe that is tested), which result in an 
antibiogram profile. The average number of treatments 
and bugs detected per episode of care is 4.5 and 2.5 
respectively. In figure [1] we give the histogram of the 
number of treatments and detected bugs per episode of 
care. 

We investigated the performance of our method 
with respect to different values of the Minimum Sup- 
port, MS, Maximum Gap, MG, and Maximum Pro- 
jected Length, MPL, parameters. For MS we tested the 
values 300, 600, 1200, 1500, and 1800 which correspond 
to 4.5%, 9.0%, 18%, 22.52%, and 27.03% of the dataset. 
For MG we experimented with 1, 2, 3, 4, and 5. For 
MPL we tested with lengths of 10, 13, 16 and 19. When- 
ever we were testing one of these parameters the others 
were set to some given value which for MS was 10% and 
18% of the dataset (i.e. two settings), for MG and MPL 
it was oo which actually means that there was no con- 
straint on them. Finally in an effort to study how RaSP 
scales with different dataset sizes we tested its perfor- 
mance with different subsample sizes, namely from 10% 
to 100% with a step of 10%, in that experiment the value 
of MS was fixed to 10% of the subsample size, of MG 
to three and of MPL to oo. The experiments were per- 
formed in two scenarios, the relationship- only in which 
we did not use the taxonomies of the events so the pat- 
terns that we were mining for where patterns of relation- 
ships between the event types, and the full scenario in 
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which the event taxonomies were also included. For each 
experiment we present the number of frequent patterns, 
the total computational time as well as the computa- 
tional time of the two stages of the algorithm, i.e. type- 
pattern mining and the subsequent step of specification 
of the type-patterns using the taxonomies (hierarchical 
mining). In the relationship-only scenario the computa- 
tional time will be dominated by the type-pattern min- 
ing phase since the subsequent mining phase which uses 
the taxonomies is limited to the small taxonomies of the 
relationships. In the full scenario the hierarchical min- 
ing step will dominate the total computational time. We 
have used a Quad-Core AMD Opteron(TM) Processor 
8356, 2.3Ghz and we have set the maximum heap size to 
50 GB, and as such, certain of our experiments (namely 
those with a small MS - <~ 10% - and no constraints) 
have raised an out of memory exception. 

5.1 Results We can see that the computational time 
of RaSP is affected by the value of the MG constraint 
with low values of it resulting in low execution time, 
top two rows of figure [31 something that is expected 
since the search space is considerably smaller for small 
MG values. For MG values smaller or equal to four 
the execution time is less than three hours, for MG=5 
it is around four hours. On the same time there 
is a significant increase on the number of discovered 
patterns which jump from less than 500 for MG=1 on 



the thousands for the other values of MG. The number 
of patterns reduces significantly when we change MS 
from 10% to 18% for all the different parameters we 
examined. The increase of the MS parameter value 
results into a significant decrease of computational time 
for all the different parameters we experimented with, 
for example for the MG parameter the decrease in 
computational time is almost 30-fold, for MPL it is 
smaller around five-fold. This decrease in computation 
time is clearly seen in the experiment in which we test 
for the MS parameter, no constrains on the MPL and 
MG parameters, fifth row from the top in figured! Note 
here that for the smallest values of the MS parameter, 
300 and 600, we got an out of memory error this is why 
the respective results are not included in the figure. In 
general the more constraints we add to the description of 
the mining problem the smaller the search space and the 
computational complexity (unlike GSP the complexity 
of which increases with the addition of constraints). 
However less constrains do not necessarily lead to larger 
number of patterns, see for example the reduction in the 
number of patterns with an increasing MPL, fourth row 
of figured This happens because with a less constrained 
definition we can get more specific patterns; this can 
sometimes lead to a decrease of the total number of 
patterns given that we mine for maximal patterns for 
each type pattern, because a given specific pattern can 
be subsumed by many more general ones but we only 
report the more specific one due to the maximal pattern 
property, 

Examining how the computational complexity 
scales with the dataset size, last row of figure [3l we 
can see that the computational time of the type-pattern 
mining increases roughly linearly with the dataset size 
while for the hierarchical mining part the behavior is 
more complex and irregular. The type-pattern mining 
part will require more time than the hierarchical mining 
part when the minimum support is set to larger values 
since there will be much less abstract patterns to spe- 
cialize. An additional observation is that the discovered 
patterns are dominated by the abstract patterns, which 
intuitively makes sense, since the specialization of the 
abstract patterns will necessarily lead to patterns that 
have lower support. This is made clear when we com- 
pare the number of patterns in the relationship-only sce- 
nario to the number of patterns in the full scenario; the 
number of specialized patterns that arc added due to 
the incorporation of the full event taxonomies are small 
compared to the total number of discovered patterns. 
Depending on the parameter setting the percentage of 
specialized patterns compared to the total number of 
frequent patterns for the full scenario ranges between 
10% and 30%. 
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Table 2: Examples of discovered patterns 



In tabled] we give a couple of examples of discovered 
patterns. The first pattern states that we have some 
detected bug which is followed by some prescribed 
treatment to which the bug is resistant. In the second 
pattern we have two subsequent treatments, which can 
be related in any way, treatments that are followed by 
a detected bug of the gammaproteobaceria family bug 
which is resistant to the first treatment and it has any 
relationship with the second. 

6 Related work 

To perform the type-pattern mining stage of the algo- 
rithm we introduced a modified version of the GSP al- 
gorithm which does not rely on the detection of pres- 
ence of a pattern within a sequence but instead returns 
all occurences of that pattern inside the sequence. By 
using occurrence itemset mining instead of relying on a 
pure Apriori-based approach, as standard GSP does, we 
avoid the generation of useless candidate patterns. The 
overhead of finding all occurrences of the type-patterns 
can often be smaller, depending on the size of the tax- 
onomies that are used, than generating and checking a 
huge number of candidates, most of which will not be 
frequent. 

Within the data mining community the only work 
to our knowledge that discusses the issue of relations 
between the items of a sequence is that of 11 . The 
authors introduce a framework to handle unary and 
binary predicates over events, unary predicates describe 
properties and conditions over a single event, binary 
predicates describe relations between items, and mine 
over a single sequence with no taxonomies. However 
they only provide a sketch of the algorithm that allows 
for pattern mining over a sequence that contains binary 
predicates and instead focus on sequences that contain 
only unary predicates. In the domain of Inductive 
Logic Programming, ILP, there is a couple of works 
for frequent sequence mining over relational sequences, 
MineSeqLog |10j . and [7]. In both mining is performed 
over sequences of predicates, where special predicates 
are introduced to describe order; both are extensions 
of Apriori like algorithms in the ILP setting. Since 
they are ILP approaches in principle both can model 
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Figure 3: Computational time of RaSP for various 
parameter settings. 



taxonomical background knowledge as chained lists 
of predicates. However, this, aside from increasing 
the search space of the candidate generation phase, 
will also produce frequent patterns that correspond to 
parts of the taxonomy relations without any reference 
to events, i.e. useless patterns. In principle this 
can be addressed by imposing specific constraints that 
the discovered patterns should fulfill, [7J provide the 
possibility of defining constraints on the patterns (they 
do not discuss mining with taxonomies), however this 
can be quite cumbersome. In RaSP, mining over the 
taxonomies is naturally integrated in the sequence item 
mining. It does not lead to the generation of useless 
patterns because the construction of the patterns is 
done on the type-patterns which are defined over event- 
types; relationships are refined only in the next stage 
of the algorithm. MineSeqLog can be seen as an 
extension of frequent relational pattern mining systems, 
such as Warmr, [B], for sequences that incorporates 
meaningfully the ordering operators. Its candidate 
generation mechanism does not have a bias towards 
the generation of sequential patterns, as a result it also 
produces patterns that are not sequential in nature as 
the example with the taxonomical patterns mentioned 
previously has already demonstrated. In RaSP this type 
of bias is inherent since we first mine over type-patterns 
which arc inherently sequential. The system proposed 
in [7J does not have such a bias either however as already 
mentioned it provides mechanisms through which one 
could declare it. In few words both approaches, [TUl 
[7J, are more like frequent pattern mining approaches 
where some of the discovered frequent patterns will 
be of sequential nature. RaSP's frequent patterns are 
sequential by construction. Additionally we can also 
define relationships over the properties of the events 
(our concepts in the event concept arrays) by extending 
the relationship concept arrays to include relationships 
over property pairs and not only over the events. Note 
here that our relationships, whether between properties 
or events, are not limited to unification, since we 
can model any type of explicit symmetric or anti- 
symmetric relationship. For example events difference, 
or events' properties difference, can be described as 
a relationship in RaSP but cannot be discovered in 
logic based approaches since for patterns of the form 
a(X),a(Y), it does not necessarily hold that X ^ 
Y: modelling that in a logic based approach requires 
the incorporation of additional predicates which would 
again produce additional meaningless frequent patterns 
that contain only these predicates. 



7 Conclusion and future work 

We have presented RaSP an algorithm for mining fre- 
quent patterns from relationship-aware sequences in the 
presence of taxonomies that describe not only the events 
but also the relationships between them. This is one of 
the few systems that are able to do this type of mining 
and unlike its inductive logic programming counterparts [12] 
which are upgrades of relational frequent pattern miners 
to sequences it discovers naturally sequential patterns 
and copes naturally with the taxonomic information. [13] 
Our approach can be easily generalized to structures 
other than sequences, e.g. graphs, trees, with hierarchi- 
cal concepts as atoms. Given an algorithm that within 
these structures finds frequent patterns and all their oc- . , 
currences at the root level of the taxonomy, i.e. our 
type-patterns, we can then derive vectors from each oc- 
currence, and proceed as before to mine the frequent 
patterns. 

[15] 
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